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Abstract 

In record linkage (RL), or exact file matching, the goal is to identify the links between entities 
with information on two or more files. RL is an important activity in areas including counting the 
population, enhancing survey frames and data, and conducting epidemiological and follow-up stud- 
ies. RL is challenging when files are very large, no accurate personal identification (ID) number is 
present on all files for all units, and some information is recorded with error. Without an unique ID 
number one must rely on comparisons of names, addresses, dates, and other information to find the 
links. Latent class models can be used to automatically score the value of information for determining 
match status. Data for fitting models come from comparisons made within groups of units that pass 
initial file blocking requirements. Data distributions can vary across blocks. This article examines 
the use of prior information and hierarchical latent class models in the context of RL. 

Key Words: Fellegi-Sunter, Gibbs sampling, Hierarchical model, Latent class model, Metropolis- 
Hastings algorithm, Mixture model. 

1 Introduction 

Record linkage (RL; Fellegi and Sunter 1969) is the name given to the activity of using information 
in two or more data bases to identify units or individuals represented in more than one of the files. 
One performs RL when merging files to avoid duplicate records and to correctly associate information 
present on the two or more files with unique individuals. RL can be used as part of population size 
estimation, survey frame and survey data enhancement, and epidemiological and longitudinal studies. In 
general when there are multiple files on a single population one could consider using RL for a number 
of activities. Larsen (2012) reviews some literature on the subject. See also Winkler (1995), Alvey and 
Jamerson (1997), and Herzog, Scheuren, and Winkler (2007). 

When files are large, accurate and unique personal identification (ID) numbers are not present on all 
files for all units, and there are errors in some information RL can be non trivial. Without an unique 
ID number one must compare names, addresses, dates, and other information to find the links between 
files. The outcome of comparisons between two records, one on each of two files, is called a vector of 
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comparisons, or comparison vector. Typcially comparisons are only made for records that pass initial 
file blocking requirements. The pairs that fall outside a block are assumed to be nonmatches. 

Once a comparison vector is computed one must decide how much evidence the set of comparisons 
give in favor of the pair being a match. In some applications, predetermined scoring procedures are used. 
As an alternative, one could imagine taking a sample of cases, determining whether the pairs truly are 
matches, and fitting a statistical model to the results. Such an approach is a type of supervised learning 
similar in nature to discriminant analysis or logistic regression. Latent class models, which are a type 
of mixture model, can be fit to used to automatically score the value of information for determining 
match status. Latent class analysis (LCA) clusters the data and is a form of unsupervised learning. Data 
for fitting models come from comparisons made within groups of units that pass initial file blocking 
requirements. LCA has been used in RL by Larsen (2012), Lahiri and Larsen (2005), Larsen and Rubin 
(2001), Winkler (1988, 1994), Jaro (1989, 1995), and others. 

This article examines the use of prior information and hierarchical latent class models in the context 
of RL. First, the impact of an informative prior distribution developed from record linkage operations 
similar to the current one is studied. Second, since data distributions can vary across blocks, a hierarchi- 
cal latent class model is used to account for inter-block heterogeneity. Both developments are studied 
through simulation and compared to LCA in terms of RL error rates. 

Section [2] presents statistical models for record linkage. Section [3T21 discusses computational algo- 
rithms. Section @]reports on a simulation study. Section |5]gives a summary and conclusions. 

2 Record Linkage Latent Class Models 

Suppose that there are two files, A and B, on a single population. Consider record a in file A and record 
b in B. Do records a and b correspond to the same person or entity? Assume files A and B do not contain 
unique identification numbers for any units in the files. Variables in the two files are used to judge the 
similarity of the record pairs. To do so one defines agreement for each piece of information common 
to both files. In a household-based study, variables can include last name currently and at birth, middle 
name or initial, first name, house and unit number, street name, age or date of birth, sex, race/ethnicity, 
and relation to head of household. Files often are preprocessed before linkage is attempted. For example, 
names can be standardized and coded according to Soundex codes or other scheme. Names and address 
fields are parsed and standardized. Birth date can be separated into day, month, and year. 

In the case of simple comparisons, for each pair of records (a, b) being considered, a vector of l's 
and 0's indicating agreement and disagreement on K comparison fields is recorded. That is, for a 6 A 
and b E B, define 

7 (a, b) = {j(a, 6)i, 7(0, b) 2 , • • • , j(a, b) K } 

where 7(0, b)k equals 1 (agreement) or (disagreement) on field k, k — 1, . . . , K. Heuristically speak- 
ing, many agreements (7(0, b) mostly l's) are typical of matches, whereas many disagreements (7(0, b) 
mostly 0's) are typical of nonmatches. Some variables (e.g., race) are informative in some locations re- 
garding matches and nonmatches, but not in others. Disagreement on sex suggests a nonmatch, whereas 
agreement on sex is not persuasive by itself for being a match. 

In this type of record linkage no one variable conclusively determines if a pair is a match or non 
match. Rather it is the composite evidence that must be judged. 
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In census and other operations, the files are divided geographically into groups of records or 'blocks' 
that do not overlap. Blocking is used in other applications as well in order to reduce the number of 
record pairs being compared. It is assumed that there are no (or very few) matches across different 
blocks. Other operations use first letter of last name (individuals) or industry code (businesses) or state 
as blocking variables. 

Let blocks be indexed by s — 1, . . . , S. Suppose that file A has n as records and file B has n bs records, 
respectively, in block s. For blocks s — 1, . . . , S, a s — 1, . . . , n as and b s = 1, . . . , n& s , define 

u h \ _ | 1 a an ^ ^ are matches 

^ s ' s ' I a and b are nonmatches 

The set of match-nonmatch indicators in block s is I s = {I(a s , b s )}. 

The match/nonmatch indicators I = {I(a, b), a G A s , b G B s , s = 1, . . . , S} are unobserved unless 
clerical review or some verification system is used. 



2.1 Latent Class Models 

The mixture model (McLachlan and Peel 2000) approach to record linkage models the probability of a 
comparison vector 7 as arising from a mixture distribution: 

Pr( 7 ) = Pr( 7 |M)p M +Pr( 7 |[/)p [7 , (1) 

where Pr( 7 |M) and Pr( 7 |[7) are the probabilities of the pattern 7 among the matches (M) and non- 
matches (£/), respectively, and pm and pu = 1 —pu are marginal probabilities of matches and unmatched 
pairs. 

The conditional independence assumption of latent class models (McCutcheon 1987) simplifies the 
model by reducing the dimension within each mixture class from 2 K — 1 parameters to K: 

Pr(7|C) = nPr^lCVHl-Pr^lC')) 1 - 7 *, (2) 

k=l 

with C G {M, U}. Interactions between comparison fields have been allowed in Larsen and Rubin 
(2001), Armstrong and Mayda (1993), Thibaudeau (1993), Winkler (1989), and others. Here we consider 
only the conditional independence model and extensions of it to a hierarchical framework. The CI 
assumption reduces the number of parameters needed to describe Pr(7) from 2 K — 1 to 2K+ 1 . Maximum 
likelihood estimation of parameters is accomplished with the EM- algorithm (Dempster, Laird, and Rubin 
1977). 

By Bayes' theorem, if the parameters were known and one does not consider restrictions from one- 
to-one matching, one could calculate for a pair (a, b) the probability that a and b match: 

Pr(/(a, b) = l| T (a, b)) = Pr(M| T (a, b)) = p A/ Pr( 7 (a, 6)|M)/Pr( 7 (a, 6)) (3) 

with the denominator given by (0Q). 
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2.2 Prior Information and Bayesian Latent Class Model 

Experience from previous record linkage operations has been used informally to select models (Larsen 
and Rubin 2001) and restrict parameters (Winkler 1989, 1994). Bayesian approaches to record linkage 
have been suggested by Larsen (1999a, 2002, 2004, 2005), Fortini et al. (2002, 2000), and McGlinchy 
(2004). 

Prior experience and data often are available from previous record linkage operations and sites. In 
previous record linkage studies, clerks at the U.S. Census Bureau looked at record pairs and determined 
whether or not they truly were nonmatches or matches. Belin (1993), Belin and Rubin (1995), Larsen 
(1999b), and Larsen and Rubin (2001) found that in some U.S. Census Bureau record linkage applica- 
tions characteristics of populations being studied varied by area in ways that made a significant impact 
on estimates of parameters needed for record linkage. There were, however, consistent patterns across 
areas. The percentage of record pairs, one record from each of two files, under consideration that ac- 
tually are matches corresponding to the same person is roughly similar across sites. The probability 
of agreeing on some key fields of information among matches and nonmatches are similar across sites. 
The probability of agreements are higher among matches than among nonmatches. There is, however, 
variability across sites in these and many other characteristics. 

Assuming the conditional independence model © and global parameters that do not vary by block, 
a prior distribution on parameters can be specified conveniently as the product of independent Beta 
distributions as follows: 

p M ~ Bet&(a M ,p M ), 
Pr( 7fc (a,6) = 1|M) ~Beta(a Mk ,p M k),k = l,...,K, 

and 

Pr( 7fc (a,6) = l\U) ~BQta(a uk ,f3 uk ),k = 1,...,K. 

Instead of specifying the prior distribution in this manner, it would conceptually be possible to specify 
a prior distribution on the whole of the probability vector associated with the set of comparison vectors 
7 as two Dirichlet distributions. That is, independent prior distributions Pr(7|M) ~ Dirichlet(5Af) an d 
Fr("f\U) ~ Dirichlet ( 5(7 ) could be specified. This option is not explored in this paper. It is noted, 
however, that pairs of records with known match status could be used as "training data" (as in Belin and 
Rubin 1995) for the purposes of specifying a prior distribution. The prior parameter values, 5 m an d 5u, 
could be considered as 'prior counts' by agreement vector pattern in the matches and nonmatches. 

If the match indicators I were known, the posterior distributions of individual parameters given values 
of the other parameters would be as follows: 

p M \l ~ Beta(a M + £ I{a, b), p M + £ i 1 - J K & ))' ( 4 ) 

(o,6) (a,6) 

Pr(7 fc (o,6) = l|Af,I) ~ BetSL(a M k + ^2lablk(a,b), 

ft« + £/«&(! -7*M))) (5) 

for k = 1, . . . , K, and 

Pr(7 fe (a, b) = 1\U, I) ~ Beta(a Uk + £(1 - I a b)lk(a, b), 

^ fc + £(l-/ a6 )(l- 7fe (a,6))) (6) 
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for k = 1, . . . , K, where I a b = I (a, b) and sums are over all pairs allowed within the blocking structure. 

2.3 Computing for the Bayesian Latent Class Model 

The posterior distribution of parameters is simulated by sampling from alternating conditional distribu- 
tions (Gibbs sampling; Geman and Geman 1984, Gelfand and Smith 1990) as follows. 

1 . Specify parameters for the prior distributions. Choose initial values of unknown parameters. 

2. Repeat the following four steps numerous times until the distribution of draws has converged to 
the posterior distribution of interest. 

(a) Draw values for the components of / independently from Bernoulli distributions with the 
probability that I(a, b) = 1 given by formula ©. 

(b) Draw a value of p M from the distribution specified in formula © and calculate p v = 1 — pu- 

(c) Draw values of Pr(7fc(a, b) = 1\M, I) independently for A; = 1, . . . , K from distributions 
specified in formula (|5]). 

(d) Draw values of Pr(7fc(a, b) = l\U, I) independently for k = 1, . . . , K from distributions 
specified in formula ©. 

3. Stop once the algorithm has converged. Convergence of the algorithm can be monitored by com- 
paring distributions from multiple independent series as suggested by Gelman and Rubin (1992) 
and Brooks and Gelman (1998). 

Once the algorithm has converged, it is necessary to decide which pairs of records to designate links 
and nonlinks and which to send to clerical review or leave undecided. One can calculate the proportion 
of times that a record pair (a, b) has I(a, b) = 1. For record pairs with a proportion exceeding a cut off, 
such as 0.90, one can make the assignment of the pair to the match group. Larsen (2012) examined the 
impact of cutoff values. 

2.4 Comments on Bayesian Latent Class Model 

There are some restrictions on parameters that potentially could improve the performance of this model 
for record linkage. First, the range of pm logically should be restricted to be less than or equal to the 
smaller of the two file sizes divided by the number of pairs under the blocking structure. When pm 
is drawn in the Gibbs sampling algorithm from its conditional distribution, values of pu greater than 
the cutoff should not be used. Alternatively, if pu = cmv'm where p' M has the Beta distribution given 
above and c M < 1 is a scale factor appropriate for transforming p' M to the allowable range of p M , one can 
sample p' M and scale it by c M . Second, logically the probability of a record pair agreeing on a comparison 
field should be larger among matches than among nonmatches. That is, Pr(7fc|M) > Pr(7fc|£7), for 
k = 1, . . . , K. Such a restriction can be added to the Gibbs sampling algorithm by simply ignoring 
sampled pairs of these probabilities that do not satisfy the constraint. Alternatively, one can draw one 
value, say Pr(7 fc |M), and scale the value of Pr(7 & |C/) to be in the range (0, Pr(7&|M)). That is, after 
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drawing a value of Pr(7 fe |M), draw a value from the Beta distribution specified in the algorithm and 
multiply it by Pr(7 fe | M) . 

Instead of specifying a specific Beta prior distributions for latent class parameters, one could consider 
specifying a hyperprior distribution, p(a, (5), for the parameters of the Beta distributions, where a = 
(a M , ®Mk, auk, k = 1,...,K) and /3 = (/3 M , P~Mk, P~uk, k = 1,...,K). One could first transform the 
parameters to a scale reflecting proportions and sample sizes: 9 = logit(^g) and r = log (a + f3), where 
the transformation is applied to corresponding components of the a and p vectors. Note that there is a 
unique bivariate inverse transformation: a = e T logit _1 (9) and (3 = e T logit _1 (l — 9). The approach of 
specifying a hyperprior distribution is not considered here, because such an approach implicitly makes 
the assumption that the parameters are exchangeable. Such an assumption is unrealistic because variables 
have very different levels of agreement for both matches and nonmatches. 



3 Record Linkage and Hierarchical Latent Class Model 

A hierarchical model for record linkage specifies distributions of parameters within blocks s = 1, . . . , S. 
The probabilities of agreeing on fields of information are allowed to vary by block. Prior distributions 
on parameters are as follows: 

PsMk = Pr(7fc = 1|M, s) ~ Beta(a sMfc , (3 sMk ) 

and 

PsUk = Pr(7fe = 1\U, s) ~ Beta(a sUk , /3 sUk ) 

independently across blocks (s = 1, . . . , S), fields (k = 1, . . . , K), and classes (M and U). As before 
one can assume that the restriction that p S Mk > PsUk- 



3.1 Hyperprior distributions for the Hierarchical Model 

Hyperprior distributions are used in this model to link estimation of probabilities across blocks. Without 
a model that enables 'borrowing strength' across blocks, parameters appear separately by blocks and 
data could be insufficient for accurately estimation. It is still the case that it is unlikely to be reasonable 
to model the probability of matching and the probability of agreement for K fields as exchangeable. As 
a result, independent hyperprior distributions are used. One version was suggested by Larsen (2004). 
The specification below generalizes previous approaches by allowing correlation. Let 

9sMk = l0glt( — ), 

OtsMk + PsMk 

T sMk = log(a s MA: + PsMk), 

9 sUk = logit( — ), 

OtsUk + PsUk 

r s uk = log(a sUk + (3 sUk ), 
9 sM = logit( — — ), 

OisM + PsM 
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and 



t s m = log(a sM + Am)- 



Then 



(9sMk, T s Mk) 



^((^flMfc, ^rMk) 




(@sUk, T s ukf 



^((fJ-eUk, UrUkf, ^Uk), 



and 



(OsM, T~sm) 



indepedendently, where 




a 8Mk &6Mk,TMk 
C9Mk,TMk &rMk 




a eUk (?8Uk,TUk 
2 

VeUk,rUk a rUk 



and 




&9M &6M,tM 
2 

C9M,tM CT tM 



) 



In the prior distributions, one could enforce the restriction that, for k = 1,...,K, 9 sM k > O s uk- 
Similarly the restriction that p S M is smaller than the minimum of ua s and ub s divided by the number of 
pairs riA s n Bg likely will be useful. If it were not enforced, the small sample size and great variability 
across blocks would surely produce poor results for some blocks. 

3.2 Computing for the Hierarchical Latent Class Model 

The posterior distribution of parameters and unobserved match/nonmatch indicators will be simulated 
using Gibbs sampling. The conditional distributions for the hyperparameters will be sampled using 
the Metropolis-Hastings (MH) algorithm (Hastings 1970) within the Gibbs sampling framework. The 
procedure iterates through draws of full conditional distributions as described below. 

1. Choose hyperparameter distributions. That is, select values for means (fiQM, ^euk, {^euk, ^tM, 
UrMk, and n TUk ) and variance matrices (S M , E^). 

2. Generate initial values of (a sM , Am) and, for k = 1, . . . , K, (a sM k, PsMk) and (a sUk , f3 sUk ) from 
their prior distributions. 

3 . Assign an initial match/nonmatch configuration I. One initialization option is to randomly generate 
a matrix of l's and 0's representing matches and nonmatches by block with the number of l's per 
block below the maximum allowable number of l's. A possibly better way to initialize I would 
be to use the ordinary latent class model to assign matches and nonmatches at a high probability 



4. Cycle through the following steps numerous times until the distribution of drawn values converges 
to the target posterior distribution. Let I ab denote I (a, b). 



cutoff. 
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(a) For s = 1,...,S, draw p sM from its conditional distribution given the current indicators l s 
and values of (a sM , Psm)- Specifically, 

Psm\I s , « s m, PsM ~ Beta(a sM + ^ I ab , (3 sM + n a n bs - ^ Lb), 
where the sum is over all pairs (a, b) in block s. Enforce the constraint 

PsM < mm(n as ,n bs )/(n as n bs ). 

(b) For s = 1, . . . , S and k = 1, . . . , K, draw p S Mk and p s uk from their conditional distri- 
bution given the current indicators I s , the comparison vectors 7 S in block s, and values of 

(a s ck, PsCk), C E {M, U}. Specifically, 

PsMk\ls, Is, a S Mk, PsMk ~ Beta(« sM fc + Lblkia, b), PsMk + ^2hb(l - lk{a,b))), 

s s 

PsUk\h, Is, a s uk, Psuk ~ Beta(a sUk + - I a b)lk{a, b), PsUk + XK 1 ~ -7fc(a, b))), 

s s 

and p S Mk > PsUk, where sums are over all pairs (a, b) in block s. 

(c) For s = 1, . . . , S, use the Metropolis-Hastings algorithm (Hastings 1970; see also Gelman 
1992 and Gelman et al. 2004, chapter 11) to draw values of hyperparameters 9 sM and t sM 
from their full conditional distributions. Details of this step and the next two steps are given 
after this outline. 

(d) For s = 1, . . . , S and k = \,...,K, use the Metropolis-Hastings algorithm to draw values 
of hyperparameters 9 S Mk an d r S Mk- 

(e) For s = 1, . . . , S and k = \,...,K, use the Metropolis-Hastings algorithm to draw values 
of hyperparameters 6 sUk and r s uk- 

(f) For s = 1, . . . , S, a = 1, . . . , n as , and b = 1, . . . , n bs , given values of p sM and, for k = 
\,...,K, p S Mk and p s iik, draw a value of I (a, b) from a Bernoulli distribution with the fol- 
lowing probability: 



PsM llfc=l 


PsMk I 1 - PsMk) ' kK ' 




{p S M nf=i 




+ (1-Psm) Uk=l 


} 1 st b) ^-PsUk) l -^ b \ 


} 



5. Stop once the algorithm has converged. 

As before, once the algorithm has converged, it is necessary to decide which pairs of records to 
designate as links. Suggestions were made at the end of the previous section. 

Details of the three Metropolis-Hastings (Hastings 1970) steps in the simulation procedure above are 
now presented. 

(c). For s = 1, . . . , S, use the Metropolis-Hastings algorithm to draw values of hyperparameters 9 s m 
and t s m from their full conditional distributions. Specifically, given current values of Q s m and t sM 
(and hence a S M and P s m), I s > and other parameters, 
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(i) Define a tuning constant h M > 0. 

(ii) Draw three values: 

u ~ Uniform(0, 1) 

and 

(0*, t*Y ~ jV 2 ((0 sM , t s m)*, £m/M 

(iii) Calculate a* = e^logir 1 ^*) and /3* = e T *logir 1 (l - 9*). 

(iv) Calculate 

r = min(l, pf,i a ^(l-p sM 'f-^ 



x exp ( —{VsM - ) ) exp ( — [T sM - r))\ 

a 0M °tM ) 

(v) If u < r, let 9 sM = 9* and t sM = r*. 

Otherwise, let 9 sM and r sM remain the same. 

(d). For s = 1, . . . , S and k = 1, . . . , K, use the Metropolis-Hastings algorithm to draw values of 
hyperparameters 9 S Mk and r S Mk- Specifically, given current values of 9 sMk and r sMk (and hence 



a S Mk and fi S Mk), Is, and other parameters, follow the steps outlined in step (c) above but with all 



M indexes replaced by Mk's. The tuning parameter huk > needs to be chosen. 

(e). For s = 1, . . . , S and k = 1, . . . , K, use the Metropolis-Hastings algorithm to draw values of 
hyperparameters 9 s uk and r sUk . Specifically, given current values of 9 sUk and r sUk (and hence 



a s uk and j3 s u k ), I s , and other parameters, follow the steps outlined in step (c) above but with all 



M indexes replaced by Uk's. The tuning parameter h Uk > needs to be chosen. 

The tuning parameters hM and, for k = 1, . . . , K, fiMk and hy k are chosen so that the drawn values of 
the parameters are accepted approximately about 35% of the time (Gelman et al. 2004 chapter 11.9). The 
algorithm could be run for several iterations to assess the acceptance rate, adapting the tuning parameters 
as necessary. A second phase then could be initiated with fixed values for tuning parameters. 



4 Simulation 

One thousand replications are performed under each of two sets of conditions. In one set of conditions, 
the probabilities of agreement are constant across blocks. In the second set of conditions, the probabilities 
of agreement vary by block. Blocks are assumed to be linked together correctly, as they would be if they 
correspond to geographical areas. Pairs from different blocks are nonlinks and not used to estimate 
probabilities. Files A and B both have 10,000 records organized into 400 blocks of size 25 each. This 
arrangement yields 250,000 (400 times 25 2 ) pairs of records. 

In the first scenario, probabilities of matching and not matching are constant across blocks. The seven 
matching variables have probabilities of agreement among matches of 0.90 to 0.60 in increments of 0.05. 
The probabilities of agreement among nonmatches is 0.5 for two variables, 0.33 for three variables, and 
0.25 for two variables. Agreements on the fields of information are independent of one another. 
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In the second scenario, probabilities of matching and not matching vary across blocks. The seven 
matching variables in each block have probabilities randomly selected from the range 0.60 to 0.90. The 
probabilities of agreement among nonmatches have probabilities randomly selected from 0.20 to 0.50. 
As before, agreements on the fields of information are independent of one another. 

The files A and B were generated and comparison vectors calculated. 

Four statistical procedures were used to do the record linkage. The first is latent class analysis (LCA). 
The second is the Bayesian latent class model (BLCM) with a chosen prior distribution. The third is the 
BLCM with a prior distribution based on a sample of similar cases. The fourth is the hierarchical LCA. 

For the LCA, the EM algorithm (Dempster, Laird, Rubin 1977) was used to fit a two-class conditional 
independence mixture model to the comparison vectors to estimate probabilities for the Fellegi-Sunter 
(1969) algorithm. 

In the first Bayesian LCM, Beta prior distributions are chosen so that the probability of agreement 
among matches, Pr(7^ = 1|M), is most likely between 0.65 and 0.95, the probability among nonmathces, 
Pr(7 fe = l\U), tends to be between 0.10 and 0.40, and the probability of a matching pair, Pr(M), is likely 
between 0.02 and 0.04. If the mean of a Beta distribution is 0.80 and its standard deviation (SD) is 0.075, 
then its parameters are approximately a Mk = 22.0 and (5 M k = 5.4. A Beta distribution with mean 0.25 
and SD 0.075 has parameters approximately ajj k = 8.1 and /?Mfc = 42.2. A Beta distribution with mean 
0.03 and SD 0.005 has parameters approximately a M = 35 and f3 M = 1128. The prior distribution for 
Pr(M) is small narrow because logically the percent of matches has to be below 4%: 10,000/250,000 = 
1/25 = 0.04. 

In the second Bayesian LCM, it is assumed that there are complete data from two blocks. That is, 
match status is known for all 1,250 pairs in the two blocks. Results of these comparisons are used as 
prior information. The number of matches and nonmathing pairs determine a M and (3 M , respectively. 
The number of agreements and disagreements on field k(k = 1, . . . , K) among matches produces auk 
and Amu, respectively. Similarly, number of agreements and disagreements on field k(k = \,...,K) 
among non matches produces a Uk and j3 Uk , respectively. The sum of au and (5m will be 1,250, which 
is comparable to their sum in the first BLCM approach. 

In the hierarchical LCA, it is necessary to select hyperprior distributions and tuning constants for the 
Metropolis-Hastings steps. If one uses the means of the prior distributions from the first Bayesian LCM 
formulation for the hyperprior means, then fi eM = logit(0.80) = 1.39, neMk = logit(0.25) = —1.10, 
(l^euk = logit(0.03) = -3.48 , /i rM = log 27.4 = 3.3, fi rM k = log 32.3 = 3.5, and fi rUk = log 1163 = 
7.1. 

Hyperprior variances and co variances describe spread and correlation of a and (3 values (transformed 
to 9 and r) across blocks. An ad hoc way of specifying these values is suggested. Among matches under 
the prior distribution from the first BLCM, one expects probabilities to be between 0.65 and 0.90. On 
the logit scale, 0.65 is 0.61 and 0.90 is 2.20. A uniform distribution bewteen 0.61 and 2.20 has variance 
(2.20-0.61)/12 = 0.1325. The prior "sample size" from the first BLCM was 27.4. One quarter of this is 
6.85, which is 1.92 on the log scale. Four times this value is 109.60, which is 4.70 on the log scale. A 
uniform distriution between 1.92 and 4.70 has variance (4.70-1. 92)/12 = 0.23. Thus <7g Mk = 0.1325 and 
a 2 rMk = 0.23 for k = 1, . . . , K. 

Similar methods can be used for the other variances. The logit of 0.10 is -2.20. The logit of 0.40 
is -0.41. The variance of a uniform distribution with these limits is 0.1492. One quarter of 32.3 is 8.1, 
which is 2.1 on the log scale. Four times 32.3 is 129.2, which is 4.86 on the log scale. The variance 
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of a uniform with these limits is 0.23. Thus a^ Uk = 0.1492 and a 2 . Uk = 0.23 for k = 1, . . . , K. The 
logits of 0.02 and 0.04 are -3.89 and -3.18, respectively. One-twelfth that range is 0.059. One quarter of 
1 163 is 291, which is 5.67 on the log scale. Four times it is 4652, which is 8.45 after log transformation. 
One-twelfth the range is 0.23. Thus a 2 M = 0.059 and a 2 M = 0.23. 

For the covariance terms between 9 and r parameters, an ad hoc process was followed. Values of a 
and (5 were computed for a range of Beta distributions with probabilities in various ranges. These values 
were transformed to 6 and f3. The covariance among them was computed. This was repeated for the 
match probability and probabilities of agreement among matches and among nonmatches. The values 
obtained are as follows: a eM k,rMk = -0.08, a e uk, T uk = -0.01, and a eM ,rM = 0.03. 

Choices for the tuning constants (Hm and for k = 1, . . . , K huk and huk) also need to be made. 
Initially these will be set of 0.5. Depending on acceptance rates in the initial set of Metropolis-Hastings 
steps, these values will be reassessed. The algorithm could use 2K + 1 different values to improve 
algorithm performance. 

Work on simulations is underway and should be completed in early 2013. 

5 Conclusions and Future Work 

A novel hierarchical Bayesian model for record linkage has been presented. The model allows proba- 
bilities to vary by block and reflect local information. Simulations are being completed to evaluate the 
performance of the proposed methods. 

Several areas can be identified for future work. Many of these will be important in actual applications. 
It will be interesting to apply these methods to data from the U.S. Census Bureau, the U.S. National 
Center for Health Statistics, and other sources. An automated system for applying these models to new 
sets of files would be useful in this regard. In a real application, one could consider better specifications 
of prior distributions for the record linkage model parameters. In particular, if data are available from 
another record linkage site and the site differs in some ways from the current application, then one must 
decide the degree to which data from the previous site should be discounted or down weighted when 
analyzing the new site. In some applications, the size of the files will be a challenge. In order to speed 
computations, one might consider parallel computations; for example, many computations are performed 
separately in each block. 

The algorithm's performance could be improved by studying tuning parameters and the order of sam- 
pling cycles within Metropolis-Hastings and Gibbs sampling algorithms. One could study the sensitivity 
of results to the specification of hyperprior distributions. 

Larsen (2004, 2005) considered one-to-one restrictions enforced in the indicator matrix I and in 
the statistical likelihood function. The one-to-one restrictions and blocking assumptions mean that 
Eb s I(a s , b s ) < 1, E Qs I( a s, b s ) < 1, and £a s Eb s , I( a s, M = for s ^ s'. The number of matches in 
block s, n ms is defined and restricted under one-to-one matching as follows: 

J2J2l(a s ,b s ) = n ms < min(n as ,n bs ). 

a s b 3 

Future work will pursue Metropolis-Hastings steps for sampling new values of I instead of the current 
simpler formulation. 
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